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Abstract 


Intuitively,  we  expect  that  averaging  —  or  bagging  —  different  regressors  with  low 
correlation  should  smooth  their  behavior  and  be  somewhat  similar  to  regularization.  In 
this  note  we  make  this  intuition  precise.  Using  an  almost  classical  definition  of  stability, 
we  prove  that  a  certain  form  of  averaging  provides  generalization  bounds  with  a  rate  of 
convergence  of  the  same  order  as  Tikhonov  regularization  —  similar  to  fashionable  RKHS- 
based  learning  algorithms. 
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Introduction 


Learning  from  examples  can  be  regarded  [8]  as  the  problem  of  approximating  a  multivariate 
function  from  sparse  data^.  The  function  can  be  real  valued  as  in  regression  or  binary  valued  as 
in  classification. 

The  accuracy  of  the  approximated  function  is  based  upon  its  performance  on  future  data,  mea¬ 
sured  in  terms  of  its  generalization  error.  Given  x  G  IR'^  and  y  G  IR  with  underlying  probability 
distribution  P(x,  ?/),  the  generalization  error  of  a  function  /  is 

lexpif]  =  J  V{yJ{x))P{x,y)  dxdy,  (1) 

where  V  is  the  loss  function  (a  typical  example  is  the  square  loss,  V^yjfix.))  =  {y  —  /(x)^). 
Usually  we  do  not  know  the  distribution  P(x,  y).  We  have  only  the  i  training  pairs  drawn  from 
P(x,  y)  from  which  we  can  measure  the  empirical  error 

4™p[/l  =  7i:i"(!/.,/(x.)).  (2) 

^  i=l 

The  problem  of  approximating  a  function  from  sparse  data  is  ill-posed  and  a  classical  way  to 
solve  it  is  regularization  theory  [12],  Regularization  theory  originates  from  Tikhonov’s  classical 
approach  for  solving  ill-posed  problems.  Existence,  uniqueness  and  especially  stability^  can  be 
restored  via  a  regularizing  operator.  The  basic  idea  at  the  heart  of  the  method  —  as  in  any 
approach  to  ill-posed  problems  —  is  to  restrict  appropriately  the  space  of  solutions  /  to  an 
appropriately  small  hypothesis  space^.  Within  the  universe  of  ill-posed  problems,  the  problem 
of  learning  theory  has  a  specific  need  —  the  derivation  of  generalization  bounds. 

1  Definitions 

This  section  and  the  next  one  (stolen  from  [10])  provide  key  definitions  and  theorems.  Given  an 
input  space  x  G  df  C  IR^  and  an  output  space  |/  G  3^  C  IR,  a  training  set 


S  =  {zi  =  (xi,  yi),  ...,ze  =  (xf,  ye)}, 

of  size  ^  in  Z  &  X  xy  is  drawn  i.i.d.  from  an  unknown  distribution  D.  We  will  refer  to  a  set 

=  {zi,...,u,...,ze}, 

where  the  point  Zi  in  set  S  is  replaced  with  an  arbitrary  new  point  u. 

Given  the  training  set  S  we  estimate  a  function  fs'-X^y.  The  error  of  this  function  with 
respect  to  an  example  2:  =  (x,  y)  is  defined  as 

y{fs,z)  =V{fs{x),y). 

^There  is  a  large  literature  on  the  subject:  useful  reviews  for  this  paper  are  [3,  9,  4,  13]  and  references  therein. 
^Stability  is  defined  as  continuous  dependence  of  the  solution  /  on  the  data  e.g.  the  approximating 

function  must  vary  little  with  small  perturbations  of  training  data. 

^The  Ivanov  method  restricts  the  solution  /  to  compact  sets  defined  by  ||/|||^  <  A  for  any  positive,  finite  A. 
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Thus  the  empirical  error  of  the  function  is 

hmpifs,  *^]  =  7  H  ^ ifs,  Zi) 

^  i=l 

where  fs  is  the  function  the  algorithm  selects  given  a  set  S  and  S  is  the  set  of  points  the  loss 
function  is  evaluated  at.  The  expected  or  generalization  error  is 

Ie.p[fs]  =  ^  V{fs,  z)P{z)  dz  =  mvifs,  ;^)], 

where  1E2[-]  is  the  expectation  for  sampled  from  the  distribution  D. 

We  state  that  a  loss  function  V  is  a-admissible  if 

ys\s^  e  G  Z  -V{fs2{^),y)\  <  a|/i(x)  -  /2(x)|. 

This  condition  was  introduced  in  [1], 

2  Stability:  old  and  new  definitions 

A  learning  algorithm  is  a  mapping  from  a  training  set  S'  to  a  function  fs- 

Definition  2.1  (Bousquet  and  Elisseeff,  2001)  [1]  An  algorithm  has  stability  (3  with  respect 
to  the  loss  function  V  if 

ys,S^’^eZfyzeZ,  \Vifs,z)-Vifs.^,z)\<p. 

Note  that  (3  will  in  general  depend  on  so  we  could  more  precisely  dehne  stability  as  a  function 
from  the  integers  to  the  reals,  but  the  usage  will  be  clear  from  context.  This  dehnition  of  stability 
is  known  as  uniform  stability.  It  is  a  restrictive  condition,  as  it  needs  to  hold  on  all  possible 
training  sets,  even  training  sets  that  can  only  occur  with  probability  0.  This  motivates  the  weaker 
notion  of  (/?,  (5)-stability. 

Definition  2.2  (Kearns  and  Ron,  1999)  [11]  An  algorithm  is  13-stable  at  S  with  respect  to 
a  loss  function  V  if 

WzEZ,  \V{fs,z)-V{fs.^,z)\<f3. 

Definition  2.3  (Kutin  and  Niyogi,  2001)  [5]  An  algorithm  A  is  {(3,S)-stable  with  respect  to 
a  loss  function  V  if 

IPs  'Is  f3 -stable  at  S)  >  1  —  5. 

It  is  obvious  that  a  /5-stable  algorithm  is  also  (/5,  (5)-stable  for  all  5  >  0.  The  following  theorems 
provide  generalization  bounds  for  /5-stable  and  (/5,  (5)-stable  algorithms. 

Theorem  2.1  (Bousquet  and  Elisseeff,  2001)  [1]  Let  A  be  a  13-stable  learning  algorithm 
satisfying  0  <  V{fs,z)  <  M  for  all  training  sets  S  and  for  all  z  E  Z.  For  all  e  >  0  and  all 
i>l, 

Fs{\Iemp[fs,  s]  -  Iexp[fs]\  >e  +  2f3eM}  <  exp  ■ 
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Theorem  2.2  (Kutin  and  Niyogi,  2001)  [5]  Let  A  be  a  {P,6)-stable  learning  algorithm 
satisfying  0  <  V{fs,z)  <  M  for  all  training  set  S  and  for  all  z  E  Z.  For  all  e,  5  >  0  and  all 
i>l, 

(  \ 

S]  -  Iexv[fs]\  >  6  +  f3e  +  Md]  <  2  exp  — ^ - -  + - . 

All  J  explJSW  HI  8{2i/3e  +  M)^  J  2i(3i  +  M 

In  general  we  are  mainly  interested  in  the  case  where  fdi  =  O  and  d  =  0(e~^).  Throughout 
the  rest  of  the  paper,  when  we  state  that  an  algorithm  is  strongly  (3  or  {jd,  5)-stable  we  mean  that 
Pi  =  O  and  d  =  0{e~^). 

Using  this  convention,  we  note  that  strongly  /5-stab le  and  strongly  (/5,  (5)-stable  algorithms  have 
asymptotically  identical  generalization  bounds,  with  differing  constants.  In  both  cases,  we  have 
(via  a  simple  restatement  of  the  theorems)  that  for  any  r  G  (0, 1),  with  probability  1  —  r, 

\hmp[fs,  S]  -  Iexp[fs]\  <  O  ’ 

which  we  also  refer  to  as  fast  convergence. 

It  is  interesting  that  several  key  learning  algorithms  are  strongly  /5-stable^.  In  a  similar  spirit, 
we  introduce  what  is  an  even  more  restrictive  dehnition  and  remark  that  it  applies  to  all  cases 
considered  by  Bousquet  and  Elisseeff. 

Definition  2.4  An  algorithm  has  a-stability  if 

ys, e  zpyz  e  Z,  |/5(x)  -  /5..(x)|  <  a.  (3) 

This  dehnition  -  which  corresponds  to  the  classification  stability  introduced  by  [1]  just  for  clas- 
sihcation  -  describes  stability  of  the  actual  functions.  It  is  closer  to  the  classical  dehnition  of 
stability  —  as  continuous  dependence  on  the  initial  data.  It  is  clear  that  (strong)  a-stability 
implies  (strong)  P- stability  for  a -admissible  loss  functions.  The  converse  is  not  true:  in  general, 
stability  wrt  the  loss  function  does  not  imply  stability  of  the  functions,  even  for  a-admissible 
loss  functions  (see  [10])®.  However,  published  proofs  of  /5-stability  of  various  algorithms  [1,  5], 
hrst  prove  that  the  functions  are  close  in  Loo,  then  use  the  ci-admissibility  condition  on  the  loss 
function  to  show  /5-stability.  For  instance,  the  proof  of  Theorem  22  of  Bousquet  and  Elisseeh 
leads  directly  to 


Theorem  2.3  Let  H  be  a  reproducing  kernel  Hilbert  space  on  a  compact  domain  X  with  kernel 
K  s.t.  for  all  x  K{x,x)  <  Ck  <  oo.  Let  the  loss  function  V  be  a-admissible.  The  learning 
algorithm  defined  by 


1 

mm  - 

/eH  i 


Y,v{f{^i),yi)  +  M\f\\K 


i=l 


(4) 


is  a-stable  with 


a  < 


Ck(z 


^In  retrospect  this  is  to  be  expected  since  regularization  induces  continuous  dependence  on  the  initial  data 
which  is  a  property  very  similar  to  /9-stability. 

®For  example,  consider  the  square  loss,  and  the  case  where  fi{x)  =  y K  and  f2{x)  =  y  —  K.  The  loss  of  the 
two  functions  is  identical,  but  their  L^o  norms  differ  by  2K  (Rifkin,  Calder  pers.  comm.). 
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3  Stability  of  bagging 

The  intuition  is  that  averaging  regressors  or  classihers  trained  on  subsamples  of  a  training  set 
should  increase  stability  with  correspondingly  better  generalization  bounds.  Note  that  through¬ 
out  this  section  we  assume  that  the  regressors  that  will  be  bagged  are  only  a-stable,  a  very  weak 
assumption.  They  are  not  assumed  to  be  strongly  a-stable.  Notice  that  in  the  following  we  are 
not  making  any  claims  about  the  empirical  error  over  the  whole  training  set!  Consider  N  regres¬ 
sors  fi{x),  each  trained  on  (in  general  different)  subsets  of  the  training  set.  Each  of  the  subsets 
has  size  p;  the  training  set  has  overall  size  i.  We  call  fl  the  regressor  corresponding  to  fi  but 
obtained  when  one  of  the  data  points  in  the  whole  training  set  is  perturbed.  The  average  bagged 
regressor  is  dehned  as  Y.f=i  fj-  It  is  straightforward  to  check  that  if  each  fi  has  a-stability  Up 
then  the  bagged  regressor  has  also  a-stability  <  ap. 

We  now  consider  a  special  sampling  scheme  for  bagging:  each  of  the  N  regressors  is  trained  on 
a  disjoint  subset  of  the  training  set.  In  this  case  N  =  ijp  with  p  hxed.  Only  one  of  the  N 
regressors  will  be  affected  by  a  perturbation  of  one  of  the  i  training  points.  Thus  only  one  of  the 
terms  in  ~  fj)']  will  be  different  from  zero.  In  this  special  case  the  a-stability  of  the 

bagged  regressor  is  Formalizing  this  reasoning  results  in  the  following  theorem. 

Theorem  3.1  Consider  the  bagged  regressor  in  which  each  of  the  N  regressors  is 

a-stable  for  a  training  set  of  size  p.  There  exist  sampling  schemes  such  that  the  bagged  regressor 
is  strongly  a-stable  with  a-stability  (^).  Its  (I-stability  with  respect  to  the  a-admissible  loss 
function  V  is  then  (^^). 

A  similar  result  extends  to  a  simple  boosting  scheme  in  which  the  bagged  classiher  (A  J2bifi)  is 
a  weighted  average  of  the  individual  regressors,  with  weights  possibly  found  by  optimization  on 
the  training  set.  We  assume  that  there  is  an  upper  bound  on  the  individual  weight  bi  for  all  i, 
i.e.  bi  <  D,  as  it  is  the  case  if  the  bi  are  normalized  (i.e.  'ffhi  =  1).  This  means  that  the  bound 
on  the  weight  of  each  regressor  in  the  resulting  boosted  function  decreases  with  increasing  N . 
Then  the  /5-stability  of  the  weighted  regressor  is,  (  ) . 

Now  consider  a  bagging  scheme  where  the  subsets  chosen  for  training  are  not  necessarily  disjoint. 
Consider  two  variants: 

1.  If  we  enforce  the  constraint  that  each  point  belongs  to  exactly  k  subsets,  then  k  functions 
will  be  affected  by  the  perturbation  of  one  point.  We  can  train  y  =  kN  regressors,  and, 

therefore,  have  the  same  bound  on  a-stability  for  the  bagged  regressor  ^  Note  that 

the  bound  on  a-stability  does  not  change  with  k  for  this  scheme.  It  would  be  interesting 
to  see  empirically  how  the  training  error  depends  on  k. 

2.  If  we  do  not  impose  the  above  restriction  on  the  number  of  subsets  a  given  point  can 
belong  to,  we  might  ask  a  question:  If  we  pick  N  =  I  subsets  at  random,  how  many 
functions  will  be  affected  by  perturbation  of  one  point?  We  can  do  a  probabilistic  analysis 
of  this  scheme  and  use  the  property  of  (/5,  (5)-stability  dehned  previously.  Note  that  the 
probability  of  each  point  being  selected  for  a  subset  is  |,  and  there  are  ^  subsets,  so  the 
expected  number  of  subsets  a  given  point  belongs  to  is  1.  Nonetheless,  we  were  unable 
to  derive  tight  exponential  (in  1)  bounds  that  would  allow  us  to  use  (/3,  (5)-stability  results 
from  [6]. 
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Remarks 


1.  If  the  individual  regressors  are  strongly  stable,  then  bagging  does  not  improve  the  rate 
of  convergence,  which  can  be  achieved  by  using  only  one  regressor  or  by  bagging  a  fixed 
number  of  regressors  trained  on  nonoverlapping  training  sets  of  size  increasing  with 

2.  In  the  case  of  regularization  with  quadratic  loss  the  stability  of  the  solution  depends  on  the 
condition  number  {\\K  +  \i I \\){\\{K  +  At'/)~^||)  <  Thus  it  is  hnite  for  hnite  A  but  can 
be  very  large  for  A  =  0  and  cannot  be  bounded  a  priori.  In  this  case,  it  seems  difficult  to 
show  in  general  that  bagging  helps.  However,  in  the  one-dimensional  radial  basis  functions 
case  with  A  =  0  we  can  use  results  (for  instance  by  Buhmann  et  al  [2,  7])  to  show  that 
bagging  can  give  an  improvement  in  the  condition  number  of  order  0{N^),  where  N  is  the 
number  of  bagged  RBF,  each  trained  on  an  optimal  subset  of  the  data  (intercalated  so  that 
the  distance  between  training  points  for  each  regressor  is  maximal). 

5  Discussion 

The  observation  captured  by  theorem  3.1  implies  that  there  exist  bagging  and  training  schemes 
providing  strong  stability  to  ensembles  of  non-strongly  stable  algorithms.  Thus  bagging  has  a 
regularization  effect  and  provides  rates  of  convergence  for  the  generalization  error  that  are  of  the 
same  order  as  Tikhonov  regularization.  It  would  be  interesting  to  extend  the  previous  analysis 
to  various  “boosting”  schemes. 

Another,  probably  more  interesting  issue  in  many  practical  situations,  is  whether  and  how  bag¬ 
ging  can  improve  stability  for  a  given,  hxed  size  i  of  the  training  set.  At  the  same  time,  the 
empirical  error  should  also  be  minimized.  Intuitively,  the  empirical  error  can  be  reduced  by 
increasing  the  size  of  the  subsamples  used  to  train  the  individual  classihers;  this  however  tends 
to  worsen  stability. 

Acknowledgments:  We  wish  to  thank  Matt  Calder  for  key  suggestions. 
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